{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 第14课：动手实战中文命名实体提取"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "命名实体识别（Named EntitiesRecognition，NER）是自然语言处理的一个基础任务。其目的是识别语料中人名、地名、组织机构名等命名实体"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "基于统计机器学习的方法主要包括隐马尔可夫模型（HiddenMarkovMode，HMM）、最大熵（MaxmiumEntropy，ME）、支持向量机（Support VectorMachine，SVM）、条件随机场（ConditionalRandom Fields，CRF）等。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import jieba\n",
    "import jieba.analyse\n",
    "import jieba.posseg as posg"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Building prefix dict from the default dictionary ...\n",
      "Loading model from cache /var/folders/yw/k7z_d_3567g16ss9plk47x9w0000gn/T/jieba.cache\n",
      "Loading model cost 1.408 seconds.\n",
      "Prefix dict has been built succesfully.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "上市 1.437080435586\n",
      "上线 0.820694551317\n",
      "奇迹 0.775434839431\n",
      "互联网 0.712189275429\n",
      "平台 0.6244340485550001\n",
      "企业 0.422177218495\n",
      "美国 0.415659623166\n",
      "问题 0.39635135730800003\n"
     ]
    }
   ],
   "source": [
    "sentence=u'''上线三年就成功上市,拼多多上演了互联网企业的上市奇迹,却也放大平台上存在的诸多问题，拼多多在美国上市。'''\n",
    "kw=jieba.analyse.extract_tags(sentence,topK=10,withWeight=True,allowPOS=('n','ns'))\n",
    "for item in kw:\n",
    "    print(item[0],item[1])\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "上市 1.0\n",
      "奇迹 0.572687398431635\n",
      "企业 0.5710407272273452\n",
      "互联网 0.5692560484441649\n",
      "上线 0.23481844682115297\n",
      "美国 0.23481844682115297\n"
     ]
    }
   ],
   "source": [
    "kw=jieba.analyse.textrank(sentence,topK=20,withWeight=True,allowPOS=('ns','n'))\n",
    "for item in kw:\n",
    "    print(item[0],item[1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from pyhanlp import *"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "sentence=u'''上线三年就成功上市,拼多多上演了互联网企业的上市奇迹,却也放大平台上存在的诸多问题，拼多多在美国上市。'''\n",
    "analyzer = PerceptronLexicalAnalyzer()\n",
    "segs = analyzer.analyze(sentence)\n",
    "arr = str(segs).split(\" \")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def get_result(arr):\n",
    "    re_list = []\n",
    "    ner = ['n','ns']\n",
    "    for x in arr:\n",
    "        temp = x.split(\"/\")\n",
    "        if(temp[1] in ner):\n",
    "            re_list.append(temp[0])\n",
    "    return re_list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['上线', '互联网', '企业', '奇迹', '平台', '问题', '美国']\n"
     ]
    }
   ],
   "source": [
    "result = get_result(arr)\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
